LunarLander

CNN

Train a supervised machine learning model to control the Lunar Lander craft based on the image dataset and perform a suitable evaluation experiment (based on the dataset) to determine how effective the model trained is.

Baseline

To establish an initial baseline, I loaded the data in its' raw form. I have not normalised or resized the images.

As seen above, the data is extremely unbalanced.

I am starting with a super basic model with one convolutional layer without padding, a maxPooling layer and two dense layers.

Below I am defining functions to plot the loss and accuracy of models, this will avoid a lot of repeated code when evaluating other models later.

The model has visibly overfit from around the third epoch. Training accuracy continues to grow but validation accuracy has dropped off.

We can see that the F1 score is significantly lower than the accuracy score, reflecting bias in the model due to the unbalanced data.

LeNet

Next I will reload the data, this time resizing and normalising it. I've decided to try the data out on a LeNet architecture, so I will be resizing the images to 32 * 32. Thanks to the decreased size of the images, training will be quicker, so I will be using the full training data set available.

Implementing LeNet model below

Training time has significantly decreased, despite using the entire dataset. Accuracy has increased on both the validation and the test set, and the F1 score has improved on the validation set. The F1 score has actually dropped on the Test set though.

Data Augmentation with weighted classes

Class 1 and class 3 are being under-predicted due to the imbalance in the data. This is can be seen by comparing the accuracy score to the lower F1 score. Data augmentation is a technique used to create new training data. I thought that perhaps by adding extra, augmented data, and manually weighting the classes in favor of the minority classes, we may alleviate some bias.

Interestingly the validation set is performing quite a bit better than the training set.

Unfortunately this model did not perform well when presented with un-augmented data. Although accuracy did not drop too much, the f1 Scores on both the validation and the test set dropped as compared with the previous model.

Unweighted Data Augmentation

The weighting of classes to address bias has not had the desired affect, let's see if data augmentation without class-weighting improves performance.

Again, the validation set is performing well on the augmented data.

Although this model has improved better than the weighted model, it is still underperforming compared to the Model trained on un-augmented data.

Balanced Data Generator

When researching data augmentation for unbalanced data, I came upon this article which uses the balanced_batch_generator from Keras to try and rebalance the dataset. The "BalancedDataGenerator" code below is taken from this article: https://medium.com/analytics-vidhya/how-to-apply-data-augmentation-to-deal-with-unbalanced-datasets-in-20-lines-of-code-ada8521320c9

This worked quite well comparatively. Although the accuracy score is lower than Model 3 and Model 4, the F1 score on both the Test and Validation set is high. This was very slow to run though, with training time taking over 4 times longer than Model 2, which has relatively comparible F1 scores.

Random Under Sampling

I then decided to try rebalancing the dataset with random undersampling. As I am using the entire dataset to train, I am hoping that undersampling does not affect accuracy or introduce new biases.

The dataset is now balanced.

With Data Augmentation

Without Data Augmentation

Undersampling makes the models very fast to train, but unfortunately we can see there has been a drop in both accuracy and F1 scores. Of the two undersampled models, the model with data augmentation performed much better than the unaugmented data.

With Dropout

In order to try and reduce the effects of overfitting, I decided to add some drop out to the model

Upon visual inspection, it seems that overfitting has been reduced, the training and validation curves are following similar trends.

Unfortunately the dropout has not improved performance on the validation and test sets.

Smote

As undersampling was not giving the desired performance improvement, I decided to try oversampling with SMOTE.

I initially ran this for 20 epochs like the other models, but loss on the validation set was still falling, and it didn't look overfit. I decided to give it another 30 epochs to see where it went.

Smote has outperformed the undersampled data, giving us the highest F1 score we have seen for the test set. But, training was extremely slow.

Sequential Input

In order to give the model a better chance of capturing the direction of the rocket in motion, I will stack images together to create sequential input to train on. I have decided to stack three images at a time.

Although the test set accuracy is the highest we've seen, the F1 scores are low. The sequential input is not alleviating bias.

Sequential with Smote

In order to hopefully raise F1 scores, we will oversample the sequential input to balance the dataset.

Results are disappointing. There has been a small bump in the F1 scores, but they are still lower than previous models with single frame input.

Pre-trained VGG

Given the simplicity of the images we are presenting to the network, a pre-trained model is probably overkill, but out of curiosity I decided to try it out.

CNN Model Comparisons

The pretrained model actually dis-improved performance quite a bit, and it was very slow.

There was no definitive winner persay, but I think model 9 presents the best rounded model. It has a fair accuracy score (above 50) and F1 scores in the higher end of what we have seen. It also, incidentally, had the longest training time. I have decided to use this to compare against my reinforcement learning model.

Reinforcement Learning

RL baseline

In establishing a baseline, I decided to use a relatively simple model architecture, and let it run for a while. The memory limit is 50,000, while our step limit is 5,000,000. This is a big discrepency, but I thought it would be an interesting place to start, and at the end of this long training time, it should be clear whether the memory limit is sufficient or not.

The model has performed very poorly, the reward has barely ever made it over zero, staying pretty firmly in the negative numbers.

Before adjusting the memory and log parameter, I decided to modify the network architecture slightly, to see if adding more connections (and complexity), might allow the network to better learn over a long period.

It has behaved pretty similarly to the first model. Performance is poor.

Increasing Window Length

There isn't a lot of documentation for Keras-RL, but from what I've understood, the window-length parameter coontrols how many samples are concatenated to form a "state". I believe setting this to 4, is somewhat equivalent to stacking four images for the CNN. I have reduced training time to 250,000 steps, as it was simply taking too long.

When window size is changed, the model architecture also needs to be slightly adjusted, as the initial "Flatten" layer input-shape, must reflect the window-length.

Performance has improved significantly. Although mean reward levels are still hovering around zero, we are seeing much less of the big negative numbers that were seen in the previous models. Some positive numbers are also now being recorded.

Here I have increased window size again, keeping the same memory and step limits.

Increasing window size from 4 to 6 has improved performance again. Although the rolling average is still in and around zero, we see big numbers, like +200, for the first time. It is clear though that this performance is unstable, we are still getting some "-400' and '-600'. Given that the model only training for 250,000 steps though, I think this is acceptable.

Increasing Training Time

I will now let this model train for 2,000,000. We will see if the instability resolves with a longer training time.

The longer training time didn't do much for the model. The average has remained pretty steady, and we still see som ebig negative numbers even closer to the end of training.

I decided to see if longer training would have a similar result with a window size of 4.

Again, the average hovers around zero, and we are still seeing big negative numbers, near to the end of training.

Increasing Sequential Memory Limit

I felt that the next logical step was to increase memory size, hoping that this would allow the model to better capture some of the complexities of the game. I also increased the log interval, as this felt like a sensible step to take when increasing the sequential memory limit.

We see an immediate positive result. Although the rolling average hasn't increased massively, we are seing many more large positive numbers.

I decided to increase both the memory and the log interval again, to see if this would continue to improve performance.

The rolling average is now sitting between +100 and +200 reward, a massive leap from our previous model.

Out of curiosity I decided to see if we would get a similar performance improvement with a window length of 4.

No comparable performance bump is seen. It is clear the window length of 6 is key to our models improvement, and it was not just the increase in memory size and log intervals.

Test RL Models

As a final step I would like to test each trained model for 20 episodes, and have a look at the average reward. As keras-rl prints reward to the stdout when testing, and has no facility to redirect this output, I will redirect the stdout to txt files, and then read the model output back in from those text files

It is clear that model 8 with a window_length of 6, and the largest memory limit and log interval size, has performed the best. All other models have a negative reward average. It is also interesting note that this model only trained for 250,000 steps, and is significantly outperforming models that trained for 5,000,000 steps. This serves to highlight the importance of parameter selection.

Comparison

Deploy each of the two models trained to the Lunar Lander game to play 200 episodes and analyse the reward achieved by the models trained using each approach.

First I will test the CNN model on 200 episodes and save the returned reward.

Now I shall deploy the RL model for 200 episdoes and save the reward.

Here we can see the RL model reward plotted against the CNN model reward. The RL model consistently outperform the CNN model.

Here is another view on the reward distribution between the two models. RL's better performance is quite clear. But, we can also see that CNN is more instable, it has a much larger range of reward, when looking at the distance between its' minimum value whisker and maximum value. It also has a significant outlier. The RL model has a smaller range between the whiskers, and no outliers.

Discussion

Approach
Results and Performance
Computational Cost